Triton 프로그래밍 입문: 스레드에서 프로그램 인스턴스로의 전환

Triton에서는 실행의 기본 단위가 CUDA 스칼라 스레드에서 프로그램 인스턴스. 이는 하나의 인스턴스가 동시에 벡터화된 '블록' 요소를 처리하는 GPU 스레드 블록의 추상화를 의미합니다.

1. 프로그램 인스턴스 식별자

모든 실행 단위는 다음을 통해 자신의 식별자를 가져옵니다 pid = tl.program_id(axis=0). 다음을 생각해 보세요: 창고 포크리프트 (프로그램 인스턴스)가 팔렛 128개 상자로 구성된 블록을 들고 있는 것과 비교하면, 단일 작업자(각각의 CUDA 스레드)가 한 개씩 상자를 들고 올리는 것과 같습니다.

2. Triton과 PyTorch 텐서의 비교

메모리 관리를 위해 의미적 차이를 이해하는 것이 중요합니다:

PyTorch 텐서: 호스트 측 파이썬 객체로, VRAM 저장 공간, 스트라이드 및 메타데이터를 감싸고 있습니다.
Triton 텐서: 컴파일러 수준의 객체로, 다음에 위치한 값 또는 포인터를 나타냅니다 레지스터 또는 SRAM.

PyTorch 보기
연속적인 글로벌 메모리 위치를 가리키는 파이썬 객체입니다.

Triton 보기
컴파일러 레지스터 내부의 2차원/1차원 데이터 블록입니다.

3. SPMD 성질

Triton은 단일 프로그램, 다수 데이터 (SPMD) 흐름을 따릅니다. 모든 프로그램 인스턴스는 정확히 동일한 코드를 실행합니다. 분기 발생은 로직이 pid 특정 메모리 오프셋 계산에 사용될 때만 발생합니다.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary identifier for a Triton execution unit?

threadIdx.x

tl.program_id(axis=0)

tl.block_idx()

torch.get_id()

QUESTION 2

True or False: A Triton tensor is a Python object that stores metadata like strides on the host CPU.

True

False

QUESTION 3

What is the result of 'forgetting that all program instances execute the same kernel body'?

The compiler will automatically distribute tasks.

Race conditions or overwriting memory if pid-based logic is missing.

The kernel will fail to compile due to a syntax error.

Execution time will double.

QUESTION 4

In the forklift analogy, what does the 'Aisle Number' represent?

The BLOCK_SIZE

The program_id (pid)

The GPU Driver version

The Pointer address

QUESTION 5

Why is the Triton model considered 'Vectorized' compared to CUDA?

It uses Python lists.

One Program Instance handles a block of elements, not just one scalar element.

It only works with 2D matrices.

It runs on the CPU's SIMD units.